Don't forget to press the Trust button!
In this project we analyze and make use of crime data from the city of Los-Angeles (link). We first explore the data and look for interesting insights regarding the types of the crimes, the victims' orientation, the weapons used or the crimes and more. We than develop a few simple tools that may be used by law enforcement agencies to discover similarities between crime events and predicting the weapons used in a crime event.
import pandas as pd
import math
import csv
import os
from typing import Dict, List, Tuple
from os.path import exists
from datetime import datetime
import time
import pandas as pd
import re
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import plotly_express as px
import cufflinks as cf
import pandas as pd
import numpy as np
from shapely.geometry import Point
from matplotlib.ticker import MaxNLocator
import geopandas as gpd
from IPython.display import IFrame
import math
from datetime import datetime
from dateutil import parser
sns.set()
%load_ext autoreload
%autoreload 2
%reload_ext autoreload
%matplotlib inline
In the following section we perform an extensive data exploration in order to verify the quality of the data, remove outliers and obtain interesting insights.
We load the the dataset of crime events from 2010 to present days in the city of Los-Angeles (LA). We also load and use a geojson file containing shape polygons of the different neighborhoods of LA (obained from here). We remove outliers (e.g., victims in the age of 0), re-organize date and time columns, determine the neighborhood each crime took place at and more.
data = pd.read_csv("Crime_Data_from_2010_to_Present.csv")
gdf = gpd.read_file(r'neighborhood_councils_losangeles.geojson')
gdf.head(3)
def get_neighborhood(row):
"""
Used for determining the neighborhood a point is located at
Args:
param: row. a row with LAT, LON coordintaes
Returns:
The neighborhood name or None if doesn't belong to any
"""
point = Point(row['LON'], row['LAT'])
for _, r in gdf.iterrows():
if r['geometry'].contains(point):
return r['name']
return None
def data_preprocessing(data):
"""
Data preprocessing and outliers removal for the remainer of the project
Args:
param: data. dataframe.
Returns:
The processed dataframe
"""
data = data[data['Vict Age']>0]
data['Vict Sex'] = data['Vict Sex'].apply(lambda x: 1 if x=='M' else 0 if x=='F' else math.nan )
data['Date Occ Only'] = data['DATE OCC'].apply(lambda x: parser.parse(x).date())
data['Date Occ Year'] = data['Date Occ Only'].apply(lambda x: x.year)
data['Month'] = data['Date Occ Only'].apply(lambda x: x.month)
data['Time'] = data['TIME OCC'].apply(lambda x: '00:0'+str(x) if len(str(x))==1 else '00:'+str(x) if len(str(x))==2 else '0'+str(x)[0]+':'+str(x)[1:] if len(str(x))==3 else str(x)[0:2]+':'+str(x)[2:4])
data['Time'] = data['Time'].apply(lambda x: datetime.strptime(x, '%H:%M').time())
data['Hour'] = data['Time'].apply(lambda x: x.hour)
days=["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"]
data['Week Day'] = data['Date Occ Only'].apply(lambda x: days[x.weekday()])
Descent = {'A': 'Other Asian', 'B': 'Black', 'C':'Chinese', 'D': 'Cambodian', 'F': 'Filipino', 'G': 'Guamanian',
'H': 'Hispanic/Latin/Mexican', 'I': 'American Indian/Alaskan Native', 'J': 'Japanese', 'K': 'Korean',
'L': 'Laotian', 'O': 'Other', 'P': 'Pacific Islander', 'S': 'Samoan', 'U':'Hawaiian', 'V': 'Vietnamese',
'W': 'White', 'X': 'Unknown', 'Z': 'Asian Indian'}
data['Vict Descent'] = data['Vict Descent'].apply(lambda x: Descent[x] if x in Descent else x)
Status = {'Invest Cont':'Investigation Continues', 'Juv Arrest':'Juvenile Arrest', 'Juv Other': 'Juvenile Other', 'UNK': 'Unknown'}
data['Status Desc'] = data['Status Desc'].apply(lambda x: Status[x] if x in Status else x)
data[["Date Occ Year", "Hour"]] = data[["Date Occ Year", "Hour"]].apply(pd.to_numeric)
data['neighborhood'] = data.apply(get_neighborhood, axis=1)
return data
data = data_preprocessing(data)
data.to_csv('neighb_preprocessed_data.csv')
data.head()
Display basic statistics of the numeric columns of the dataset
data.describe().round(2)
data_missing = data.isna()
data_missing_numbers = data_missing.sum()
data_missing_numbers = data_missing_numbers.sort_values()
f, ax1 = plt.subplots(1, 1, sharex=True, figsize=(10, 5))
data_missing_numbers.plot.bar()
We can see that in most of the columns we don't have missing data. Missing data in the columns 'Cm cd2', 'Cm cd3', 'Cm cd4' make sense since there description suggest that they: "May contain a code for an additional crime, less serious than Crime Code 1." Regarding to the weapon field missing values - not in all crimes a weapon is in use or known.
First, let's see the number of crime types that exist in the sdataset, using the Crm Cd Desc column:
print('There are {} crime types'.format(len(data['Crm Cd Desc'].unique())))
Since the number of types is quite large and some types are devided into multiple sub-types (e.g. THEFT-GRAND, PETTY THEFT,...), we want to further analyze the types and see which words are most frequent when describing the crime types so we can obtain the more general types of crimes:
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
from collections import Counter
%matplotlib inline
def draw_word_cloud(words_list, min_times=10):
"""
Draws a wordcloud
Args:
param1: words_list. list of the words.
param2: min_times. minimum word counts to decide whether to display the word.
"""
stopwords = set(STOPWORDS) | {"doc", "date","memo", "subject", 'state',
}
stopwords_parts = {"'s", " ' s'", " `s" }
wordcloud = WordCloud(width = 800, height = 800,
background_color ='white',
stopwords = stopwords,
min_font_size = 10)
def skip_entity(e):
if e in stopwords:
return True
for p in stopwords_parts:
if p in e:
return True
return False
c = Counter(words_list)
# using the subject frquencies
d = {k:v for k,v in dict(c).items() if v > min_times and not skip_entity(k)}
wordcloud.generate_from_frequencies(d)
plt.figure(figsize = (10, 20), facecolor = None)
plt.imshow(wordcloud)
find_most_common_Crimes = []
for row in data['Crm Cd Desc']:
for word in row.split(' '):
if len(word)>3:
find_most_common_Crimes.append(word)
draw_word_cloud(find_most_common_Crimes, min_times=20)
By just looking on the crime description word cloud plot we can learn what are the main crimes- assault, theft, burglary and vandalism. We continue the analysis by identifying the top crime types and re-configuring each crime event type to one of the three more general crime types - THEFT (includes also burglary), ASSAULT and VANDALISM.
Let's see first what are the 15 most common crime types in the dataset:
top_crime_types = pd.DataFrame(data['Crm Cd Desc'].value_counts()[:15]).reset_index().rename(columns={"index": "crime type", 'Crm Cd Desc': 'count'})
f, ax1 = plt.subplots(1, 1, sharex=True, figsize=(15, 7))
g = sns.barplot(x=top_crime_types['count'],y=top_crime_types['crime type'], palette="rocket")
f.legend(loc = 'right')
g.set_title("Crime Counts per top 15 Types", fontsize=25)
# g.set_ylabel("Event Count",fontsize=15)
g.set_xlabel("Crime Type Count",fontsize=15)
g.tick_params(labelsize=15)
top_crime_types['count'].sum()/data.shape[0]
We can see that the top 15 types (out of 139) constitute 81.16 percent of the crimes. From each of the 15 types we obtain the more general crime type (1 of the 3 types) and add it as an additional column to the table:
top_crime_list = top_crime_types['crime type'].tolist()
assault_crimes = [i for i in top_crime_list if 'ASSAULT' in i]
theft_crimes = [i for i in top_crime_list if ('THEFT' in i or 'ROBBERY' in i or 'BURGLARY' in i) and not i=='THEFT OF IDENTITY']
vandalism_crimes = [i for i in top_crime_list if 'VANDALISM' in i]
top_data = data[data['Crm Cd Desc'].isin(theft_crimes+assault_crimes+vandalism_crimes)][['Crm Cd Desc', 'Date Occ Year', 'Month', 'LAT', 'LON']]
def get_type(code):
"""
Return the general type of a crime code.
Args:
param: code. the code of the crime.
Returns:
The general crime type type
"""
if code in assault_crimes:
return 'ASSAULT'
elif code in theft_crimes:
return 'THEFT'
elif code in vandalism_crimes:
return 'VANDALISM'
else:
return 'OTHER'
top_data['Crime Type'] = top_data['Crm Cd Desc'].apply(lambda x: get_type(x))
top_data = top_data[top_data['Crime Type'] != 'OTHER']
top_data.head()
Crime type map: We use the resulting table to visualize on a map of LA the crimes by their types. We define a function that given a year and a month, plots all the crimes by their type on the LA map. We use the Folium package for that purpose:
from branca.element import Template, MacroElement
def crime_type_map(crime_data, year, month):
"""
Plots LA map with dots representing crimes by their general type.
Args:
param1: crime_data.
param2: year. The year we want to filter the crime_data by.
param3: month. The month we want to filter the crime_data by.
"""
crime_data = crime_data[crime_data['Date Occ Year']==year].reset_index(drop=True)
crime_data = crime_data[crime_data['Month']==month].reset_index(drop=True)
m = folium.Map(location=[34.052, -118.2437], zoom_start = 11, prefer_canvas=True)
for i in range(len(crime_data)):
if crime_data.iloc[i]['Crime Type'] == 'THEFT':
folium.Circle([crime_data.iloc[i]['LAT'], crime_data.iloc[i]['LON']], popup=crime_data.iloc[i]['Crm Cd Desc'], radius=0.8, color='green', fill_color='green').add_to(m)
elif crime_data.iloc[i]['Crime Type'] == 'ASSAULT':
folium.Circle([crime_data.iloc[i]['LAT'], crime_data.iloc[i]['LON']], popup=crime_data.iloc[i]['Crm Cd Desc'], radius=0.8, color='red', fill_color='red').add_to(m)
elif crime_data.iloc[i]['Crime Type'] == 'VANDALISM':
folium.Circle([crime_data.iloc[i]['LAT'], crime_data.iloc[i]['LON']], popup=crime_data.iloc[i]['Crm Cd Desc'], radius=0.8, color='blue', fill_color='blue').add_to(m)
template = """
{% macro html(this, kwargs) %}
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>jQuery UI Draggable - Default functionality</title>
<link rel="stylesheet" href="//code.jquery.com/ui/1.12.1/themes/base/jquery-ui.css">
<script src="https://code.jquery.com/jquery-1.12.4.js"></script>
<script src="https://code.jquery.com/ui/1.12.1/jquery-ui.js"></script>
<script>
$( function() {
$( "#maplegend" ).draggable({
start: function (event, ui) {
$(this).css({
right: "auto",
top: "auto",
bottom: "auto"
});
}
});
});
</script>
</head>
<body>
<div id='maplegend' class='maplegend'
style='position: absolute; z-index:9999; border:2px solid grey; background-color:rgba(255, 255, 255, 0.8);
border-radius:6px; padding: 10px; font-size:14px; right: 20px; bottom: 450px;'>
<div class='legend-title'>Crime Type</div>
<div class='legend-scale'>
<ul class='legend-labels'>
<li><span style='background:red;opacity:0.7;'></span>ASSAULT</li>
<li><span style='background:blue;opacity:0.7;'></span>VANDALISM</li>
<li><span style='background:green;opacity:0.7;'></span>THEFT</li>
</ul>
</div>
</div>
</body>
</html>
<style type='text/css'>
.maplegend .legend-title {
text-align: left;
margin-bottom: 5px;
font-weight: bold;
font-size: 90%;
}
.maplegend .legend-scale ul {
margin: 0;
margin-bottom: 5px;
padding: 0;
float: left;
list-style: none;
}
.maplegend .legend-scale ul li {
font-size: 80%;
list-style: none;
margin-left: 0;
line-height: 18px;
margin-bottom: 2px;
}
.maplegend ul.legend-labels li span {
display: block;
float: left;
height: 16px;
width: 30px;
margin-right: 5px;
margin-left: 0;
border: 1px solid #999;
}
.maplegend .legend-source {
font-size: 80%;
color: #777;
clear: both;
}
.maplegend a {
color: #777;
}
</style>
{% endmacro %}"""
macro = MacroElement()
macro._template = Template(template)
m.get_root().add_child(macro)
return m
m = crime_type_map(top_data, year=2019, month=6)
m.save('map.html')
IFrame(src='map.html', width=1000, height=600)
We can see that in june 2019, all 3 types of crimes occured the at all the areas of LA, but that some areas, such as Beverley-Hills and West Hollywood, were more affected by theft than assault or vandalism and that assaults were more common in central LA.
Analysis of the different information about the victims of the crimes
Lets look at the distribution plots of the crime counts by their victim age, and compare Male vs. Female:
f, ax1 = plt.subplots(1, 1, sharex=True)
g = sns.distplot(data[data['Vict Sex'] == 1]['Vict Age'], axlabel="Vict Age", color='g', label='Man')
g = sns.distplot(data[data['Vict Sex'] == 0]['Vict Age'], axlabel="Vict Age", color='r', label = 'Female')
f.legend(loc = 'right')
g.set_title("Crime Events per Victim Sex and Victim Age", fontsize=20)
g.set_ylabel("Event Count",fontsize=15)
g.set_xlabel("Victim Age",fontsize=15)
g.tick_params(labelsize=10)
We can see that we get similar events distribution for men and females crime victims by their age and that most victims are in their twenties.
The datasets also contains information about the victim's descent of each crime. Here we plot the crime count by the victims descent:
f, ax1 = plt.subplots(1, 1, sharex=True, figsize=(15, 7))
group = data.groupby(["Vict Descent"])['Vict Descent'].agg(['count'])
group = group.reset_index(level=['Vict Descent'])
group = group.sort_values("count", ascending=False).head(10)
g = sns.barplot(x=group['count'],y=group['Vict Descent'], palette="rocket")
f.legend(loc = 'right')
g.set_title("Crime Events per top 10 Victim Descent", fontsize=25)
g.set_ylabel("Event Count",fontsize=15)
g.set_xlabel("Victim Descent",fontsize=15)
g.tick_params(labelsize=15)
We can see that most victims descents are Hispanic or Latin or Mexican
In order to find out whether the location of the crime might indicate what is the descent of the victim we visualize the crime events on a map, colored by the descent of the victims. We decided to focus on the most severe crime of the dataset - criminal homicide. To simplify the data we considered Chinese, Koreans and Filipinos as Asians:
homicides_data = data[data['Crm Cd Desc']=='CRIMINAL HOMICIDE'][['LAT', 'LON', 'Vict Descent', 'Vict Sex']]
homicides_data = homicides_data.replace(['Other Asian','Korean', 'Filipino', 'Chinese'], 'ASIAN')
homicides_data = homicides_data[homicides_data['Vict Descent']!='Other'].reset_index(drop=True)
m = folium.Map(location=[34.022, -118.2437], zoom_start = 10.45, prefer_canvas=True)
females = folium.FeatureGroup("Females")
men = folium.FeatureGroup("Men")
for i in range(len(homicides_data)):
if homicides_data.iloc[i]['Vict Descent'] == 'Hispanic/Latin/Mexican':
if homicides_data.iloc[i]['Vict Sex'] == 1:
men.add_child(folium.Circle([homicides_data.iloc[i]['LAT'], homicides_data.iloc[i]['LON']], radius=0.25, color='blue', fill_color='blue'))
else:
females.add_child(folium.Circle([homicides_data.iloc[i]['LAT'], homicides_data.iloc[i]['LON']], radius=0.25, color='blue', fill_color='blue'))
elif homicides_data.iloc[i]['Vict Descent'] == 'Black':
if homicides_data.iloc[i]['Vict Sex'] == 1:
men.add_child(folium.Circle([homicides_data.iloc[i]['LAT'], homicides_data.iloc[i]['LON']], radius=0.25, color='green', fill_color='blue'))
else:
females.add_child(folium.Circle([homicides_data.iloc[i]['LAT'], homicides_data.iloc[i]['LON']], radius=0.25, color='green', fill_color='blue'))
elif homicides_data.iloc[i]['Vict Descent'] == 'White':
if homicides_data.iloc[i]['Vict Sex'] == 1:
men.add_child(folium.Circle([homicides_data.iloc[i]['LAT'], homicides_data.iloc[i]['LON']], radius=0.25, color='red', fill_color='blue'))
else:
females.add_child(folium.Circle([homicides_data.iloc[i]['LAT'], homicides_data.iloc[i]['LON']], radius=0.25, color='red', fill_color='blue'))
elif homicides_data.iloc[i]['Vict Descent'] == 'ASIAN':
if homicides_data.iloc[i]['Vict Sex'] == 1:
men.add_child(folium.Circle([homicides_data.iloc[i]['LAT'], homicides_data.iloc[i]['LON']], radius=0.25, color='black', fill_color='blue'))
else:
females.add_child(folium.Circle([homicides_data.iloc[i]['LAT'], homicides_data.iloc[i]['LON']], radius=0.25, color='black', fill_color='blue'))
men.add_to(m)
females.add_to(m)
folium.LayerControl(collapsed=False).add_to(m)
template = """
{% macro html(this, kwargs) %}
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>jQuery UI Draggable - Default functionality</title>
<link rel="stylesheet" href="//code.jquery.com/ui/1.12.1/themes/base/jquery-ui.css">
<script src="https://code.jquery.com/jquery-1.12.4.js"></script>
<script src="https://code.jquery.com/ui/1.12.1/jquery-ui.js"></script>
<script>
$( function() {
$( "#maplegend" ).draggable({
start: function (event, ui) {
$(this).css({
right: "auto",
top: "auto",
bottom: "auto"
});
}
});
});
</script>
</head>
<body>
<div id='maplegend' class='maplegend'
style='position: absolute; z-index:9999; border:2px solid grey; background-color:rgba(255, 255, 255, 0.8);
border-radius:6px; padding: 10px; font-size:14px; right: 10px; bottom: 350px;'>
<div class='legend-title'>Victim Decsent</div>
<div class='legend-scale'>
<ul class='legend-labels'>
<li><span style='background:red;opacity:0.7;'></span>White</li>
<li><span style='background:blue;opacity:0.7;'></span>Latin</li>
<li><span style='background:green;opacity:0.7;'></span>Black</li>
<li><span style='background:black;opacity:0.7;'></span>Asian</li>
</ul>
</div>
</div>
</body>
</html>
<style type='text/css'>
.maplegend .legend-title {
text-align: left;
margin-bottom: 5px;
font-weight: bold;
font-size: 90%;
}
.maplegend .legend-scale ul {
margin: 0;
margin-bottom: 5px;
padding: 0;
float: left;
list-style: none;
}
.maplegend .legend-scale ul li {
font-size: 80%;
list-style: none;
margin-left: 0;
line-height: 18px;
margin-bottom: 2px;
}
.maplegend ul.legend-labels li span {
display: block;
float: left;
height: 16px;
width: 64px;
margin-right: 5px;
margin-left: 0;
border: 1px solid #999;
}
.maplegend .legend-source {
font-size: 80%;
color: #777;
clear: both;
}
.maplegend a {
color: #777;
}
</style>
{% endmacro %}"""
macro = MacroElement()
macro._template = Template(template)
m.get_root().add_child(macro)
m
Unfortunately, we can see a clear division of the homicides locations by the descent of the victims. This might indicate that LA population is devided by the descent of its people to different areas of the city. We can also see that the white population is much less affected by murders compared to the blacks or latin populations.
(1) Crime counts by weapon type and (2) crime counts by weapon in respect to year, hour of the day and vitim gender interactively:
f, ax1 = plt.subplots(1, 1, sharex=True, figsize=(15, 7))
group = data.groupby(["Weapon Desc"])['Weapon Desc'].agg(['count'])
group = group.reset_index(level=['Weapon Desc'])
group = group.sort_values("count", ascending=False).head(10)
Weapon_most_common = group['Weapon Desc'].to_list()
g = sns.barplot(x=group['count'],y=group['Weapon Desc'], palette="rocket")
f.legend(loc = 'right')
g.set_title("Crime Events per Top 10 Weapons", fontsize=25)
g.set_ylabel("Event Count",fontsize=15)
g.set_xlabel("Weapon",fontsize=15)
g.tick_params(labelsize=15)
We can see that the weapon that is most in use is by hand and verbal threats
Weapon_most_common_ds = data[data['Weapon Desc'].isin(Weapon_most_common)]
Weapon_most_common_ds['Vict Sex des'] = Weapon_most_common_ds['Vict Sex'].apply(lambda x: 'Male' if x==1 else 'Female' if x==0 else 'Unknown' )
Weapon_most_common_ds = Weapon_most_common_ds[Weapon_most_common_ds['Vict Sex des'].isin(['Male','Female'])]
Weapon_most_common_ds['Weapon Desc'] = Weapon_most_common_ds['Weapon Desc'].apply(lambda x: x[0:20] )
group = Weapon_most_common_ds.groupby(["Weapon Desc",'Hour','Date Occ Year','Vict Sex des'])['Weapon Desc'].agg(['count'])
group = group.reset_index(level=["Weapon Desc",'Hour','Date Occ Year','Vict Sex des'])
px.scatter(group, x="Hour", y="count", animation_frame="Date Occ Year",# animation_group="Vict Sex des",
size="count", color="Weapon Desc",title='Case Number Aginst Female and Man by Weapon Over Hours and Years', facet_col ="Vict Sex des")
We can see that females are more attacked than men and that the trend stays the same throw the years. In addition, we can see that there are much more attacks by hand than by other ways and that at night we have more attacks than in daylight.
-In 2019 we can see a decrease in the number of event compare to the other years, that is because the year is still not over (we can see the same behavior on the next plots too)
group = data.groupby(["Status Desc",'Date Occ Year'])['Status Desc'].agg(['count'])
group = group.reset_index(level=["Status Desc",'Date Occ Year'])
px.bar(group, x="Status Desc", y="count", animation_frame="Date Occ Year", color="Status Desc",
title='Case Status Number Over the Years')
We can see that we have much more open investigations than other cases statues. In addition, we can see that the adults arrest number is higher than the juvenile arrest number and that the trend stays the same over the years.
The LAPD has 21 Community Police Stations referred to as Geographic Areas within the department. We will try to reveal some insights about those areas.
We first present a heat-map of the crime events in LA over time in an interactive map (using Folium). The map also contains markers denoting the locations of the 21 LAPD police stations, using a geojson with the stations coordinates. Hover over a marker to see the name of the station's division. Press the Play button to see the heat-map progress over time (month/year).
import folium
from sklearn.utils import shuffle
from folium.plugins import HeatMapWithTime
m = folium.Map(location=[34.022, -118.2437], zoom_start = 10.47) # , tiles = tiles
data_location = data[['LAT','LON','Date Occ Only']]
data_location = shuffle(data_location)[:66000]
data_location['Year'] = data_location['Date Occ Only'].apply(lambda x: x.year)
data_location['Month'] = data_location['Date Occ Only'].apply(lambda x: x.month)
data_location['time_lapse'] = data_location.apply(lambda x: x['Year']+x['Month']/100, axis=1)
data_l = [data_location[data_location['time_lapse']==data_location['time_lapse'].unique()[i]][['LAT','LON']].values.tolist()
for i in range(len(data_location['time_lapse'].unique()))]
index = [str(int(round((i%1)*100)))+'/'+str(int(i-(i%1))) for i in sorted(data_location['time_lapse'].unique())]
HeatMapWithTime(data_l, index=index, radius=6, auto_play=True).add_to(m)
LAPD_stations = gpd.read_file(r'LAPD_Police_Stations.geojson')
folium.GeoJson(LAPD_stations, name='DIVISION', tooltip=folium.features.GeoJsonTooltip(fields=['DIVISION'])).add_to(m)
m
In this map we visualize the number of incidents that occured in the different areas of LA (the 21 LAPD divisions areas). The areas' polygons were obtained from a geojson file of the LAPD divisions areas (obained from here). Hover over each area and see its name and the crimes count of the area.
LAPD_AREAS_gdf = gpd.read_file(r'lapd_divisions.json')
data = data.replace('N Hollywood', 'North Hollywood')
data = data.replace('West LA', 'West Los Angeles')
LAPD_AREAS_gdf['style'] = [
{'fillColor': [0,0,0],
'fillOpacity': 0.0,
'weight': 0.2,
'color': 'black'}]*len(LAPD_AREAS_gdf)
group_area = data.groupby(['AREA NAME'])['AREA NAME'].agg(['count']).reset_index()
LAPD_AREAS_gdf['count'] = LAPD_AREAS_gdf.apply(lambda r: group_area[group_area['AREA NAME']==r['name']]['count'].values[0], axis=1)
la_geo = r'lapd_divisions.json'
m = folium.Map(location = [34.015, -118.26], zoom_start = 10)
a = folium.Choropleth(
geo_data = r'lapd_divisions.json',
fill_opacity = 0.7,
line_opacity = 0.2,
data = group_area,
key_on = 'feature.properties.name',
columns = ['AREA NAME', 'count'],
fill_color = 'OrRd',
name='Number of crimes',
legend_name = 'bi'
)
a.add_to(m)
folium.GeoJson(data=LAPD_AREAS_gdf,
name='LAPD DIV',smooth_factor=2,
style_function=lambda x: {'color':'black','fillColor':'transparent','weight':0.2},
tooltip=folium.GeoJsonTooltip(fields=['count', 'name'],
labels=False,
sticky=False),
highlight_function=lambda x: {'weight':0.6,'fillColor':'grey'}
).add_to(m)
m
We can see that the more troubled areas are Southwest and 77th Street. This also can be seen in the heat-map of the cell above.
We further analyze the data per area - by year, by hour of the day and by day of the week.
group = data.groupby(["Date Occ Year", 'AREA NAME'])['AREA NAME'].agg(['count'])
results_group = group.reset_index(level=['Date Occ Year', 'AREA NAME'])
group_area = data.groupby(['AREA NAME'])['AREA NAME'].agg(['count'])
results_group_area = group_area.reset_index(level=[ 'AREA NAME'])
results_group_area['Area Name Most Common'] = results_group_area[['count','AREA NAME']].apply(lambda x: x['AREA NAME'] if x['count']>75000 else 'Other', axis=1)
# results_group_area.sort_values('count', ascending= False).reset_index()
results_group = results_group.merge(results_group_area, on = 'AREA NAME')
f, ax = plt.subplots(figsize=(11.7, 8.27))
g = sns.lineplot(ax=ax,x="Date Occ Year", y="count_x", hue='Area Name Most Common',data=results_group)
g.set_title("Crime Events per Area and Year", fontsize=20)
g.set_ylabel("Event Count",fontsize=15)
g.set_xlabel("Year",fontsize=15)
g.tick_params(labelsize=10)
plt.xticks(np.arange(2010, 2020, step=1))
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
We can see that the number of crime events stayed almost the same in most areas except in the central where there was an increase in the event number in 2018. In addition, we can see that '77th street' and 'Southwest' are the areas we the highest event number per year.
group = data.groupby(["Hour", 'AREA NAME'])['AREA NAME'].agg(['count'])
results_group = group.reset_index(level=['Hour', 'AREA NAME'])
results_group = results_group.merge(results_group_area, on = 'AREA NAME')
f, ax = plt.subplots(figsize=(11.7, 8.27))
g = sns.lineplot(ax=ax,x="Hour", y="count_x", hue='Area Name Most Common',data=results_group)
g.set_title("Crime Events per Area and Hour", fontsize=20)
g.set_ylabel("Event Count",fontsize=15)
g.set_xlabel("Hour",fontsize=15)
g.tick_params(labelsize=10)
plt.xticks(np.arange(0, 24, step=1))
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
We can see that in the early morning there is a decrease in the events number and an increase during noon.
group = data.groupby(['Week Day', 'AREA NAME'])['AREA NAME'].agg(['count'])
results_group = group.reset_index(level=['Week Day', 'AREA NAME'])
results_group = results_group.merge(results_group_area, on = 'AREA NAME')
f, ax = plt.subplots(figsize=(11.7, 8.27))
g = sns.lineplot(ax=ax,x='Week Day', y="count_x", hue='Area Name Most Common',data=results_group)
g.set_title("Crime Events per Area and Week Day", fontsize=20)
g.set_ylabel("Event Count",fontsize=15)
g.set_xlabel("Week Day",fontsize=15)
g.tick_params(labelsize=10)
days=["Sunday", "Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]
plt.xticks(range(len(days)), days)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
We used a geojson file that contains LA's neighborhoods polygons as well as additional features for each neighborhood (obained from here) and merged it with a table with estimated features of each neighborhood by the us-census (obained from here). Using each crime's coordinates we assigned it to a neighborhood.
from geopandas import GeoDataFrame
neiborhoods_df = pd.read_csv('census-data-by-neighborhood-council.csv')
LA_neighborhoods_gdf = gpd.read_file(r'neighborhood_councils_losangeles.geojson')
LA_neighborhoods_gdf['geometry'] = LA_neighborhoods_gdf['geometry'].simplify(0.00005, preserve_topology=True)
LA_neighborhoods_gdf['style'] = [
{'fillColor': [0,0,0],
'fillOpacity': 0.0,
'weight': 0.2,
'color': 'black'}]*len(LA_neighborhoods_gdf)
group_area = data.groupby(['neighborhood'])['neighborhood'].agg(['count']).reset_index()
joined_df = LA_neighborhoods_gdf.set_index('name').join(group_area.set_index('neighborhood'))
neighb_gdf=joined_df.join(neiborhoods_df.set_index('NC_Name'))
neighb_gdf = neighb_gdf[['geometry', 'count', 'Total Population', 'In_Poverty', 'Owner_occ', 'Renter_occ']]
crs = {'init': 'epsg:4326'}
neighb_gdf = GeoDataFrame(neighb_gdf.reset_index(), crs=crs)
neighb_gdf.head()
Using the Total Population estimation column of each neighborhood we created a choripleth map with the crime count in each neighborhood relative to its population size. This way we hope to create a more objective analysis of the crime count around the city.
neighb_gdf['CrimesPerCapita'] = neighb_gdf.apply(lambda x: round(x['count']/x['Total Population'], 2), axis=1)
m = folium.Map(location = [34.015, -118.26], zoom_start = 10)
a = folium.Choropleth(
geo_data = neighb_gdf,
fill_opacity = 0.7,
line_opacity = 0.2,
data = neighb_gdf,
key_on = 'feature.properties.name',
columns = ['name', 'CrimesPerCapita'],
fill_color = 'OrRd',
name='Number of crimes',
legend_name = 'bi',
bins=8
)
a.add_to(m)
folium.GeoJson(data=neighb_gdf,
name='LAPD Neigh.',smooth_factor=2,
style_function=lambda x: {'color':'black','fillColor':'transparent','weight':0.2},
tooltip=folium.GeoJsonTooltip(fields=['CrimesPerCapita', 'name'],
labels=False,
sticky=False),
highlight_function=lambda x: {'weight':0.6,'fillColor':'grey'}
).add_to(m)
m
We can see that the most crime-stricken neighborhood in LA is Downtown LA.
Using the In_Poverty estimation column of each neighborhood we created scatter plot with the number of crimes in respect to the number of poor population of the different neighborhoods.
plt.figure(figsize=(12, 7))
corr_df = pd.DataFrame(neighb_gdf[['name', 'count', 'Total Population', 'In_Poverty']])
g = sns.scatterplot(x='In_Poverty', y="count", data=corr_df)
g.set_title("Crimes Count in-respect-to Size of Poor Population", fontsize=20)
g.set_ylabel("Event Count",fontsize=15)
g.set_xlabel("Poor Population Number",fontsize=15)
It seems as if there is a correlation between the number of poor population of the neighborhood to the amount of crime in the neighborhood.
Each crime in the dataset is defined by a set of features, including features about the victim, the premis where the crime took place at, the coordinates of the event, the weapon used for the crime, the crime description and the time of the event. In addition to that, each crime is defined by the Modus Operandi of the criminal that performed the crime, which refers to the methods the criminals took in order to commit the crime. These are free text descriptions that can go through NLP text analysis and feature extraction. In the following section we present two tools that take advantage of these features to establish links between the crimes using different similarity and clustering measures. The purpose of the tools is to assist law enforcement agencies to solve crimes more easily by finding crimes that were commited by the same criminals/gangs.
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
import turicreate as tc
import matplotlib.pyplot as plt
import nltk
nltk.download('punkt')
%matplotlib inline
The dataset contains only codes of the modus-operandi (MO) descriptions. In the following cells we load the actual MO descriptions and create a document of all the MOs for each crime. We then perform a word2vec on these documents using the Genesim and a trained word2vec model on google-news.
data_cluster = data.rename(columns={'AREA ': 'AREA'}).reset_index(drop=True)
MO_Codes = pd.read_csv('MO_Codes.csv')
mo_codes_d={}
for i, line in MO_Codes.iterrows():
mo_codes_d[int(line['MO_Code'])] = line['Description']
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
import nltk
from nltk.tokenize import word_tokenize
import numpy as np
def txt2vector(txt):
words = word_tokenize(txt)
words = [w for w in words if w in model]
if len(words) != 0:
return np.mean([model[w] for w in words], axis=0)
return None
data_cluster = data_cluster[['AREA NAME', 'Crm Cd Desc', 'Mocodes',
'Vict Age', 'Vict Sex', 'Vict Descent','Crm Cd 1', 'Crm Cd 2', 'Crm Cd 3', 'Weapon Desc', 'Premis Desc',
'LAT', 'LON', 'Date Occ Year', 'Month', 'Hour', 'Week Day', 'neighborhood']]
import numpy as np
def from_codes_to_desc(x):
str_r=''
for i in x:
if i is not np.nan and i == i and i!='nan' and i!='-' and i!='':
if int(i) in mo_codes_d:
str_r= str_r+' '+mo_codes_d[int(i)]
return str_r
data_cluster['MO_Description'] = data_cluster['Mocodes'].apply(lambda x: from_codes_to_desc(str(x).split(' ')))
from tqdm import tqdm
l = []
crm_codes = ['Crm Cd 2', 'Crm Cd 3']
data_cluster.reset_index(drop=True)
for i in tqdm(range(len(data_cluster))):
text = data_cluster.iloc[i]["MO_Description"]
text += data_cluster.iloc[i]["Crm Cd Desc"]
for crm in data_cluster.iloc[i][crm_codes]:
if not pd.isnull(crm):
desc = data[data['Crm Cd']==crm]
if desc.shape[0] > 0:
crm_desc = desc.iloc[0]['Crm Cd Desc']
text += crm_desc
l.append(txt2vector(text))
data_cluster['word2vec_MO_codes_Description'] = l
data_cluster.to_csv('final_data.csv')
data_cluster = tc.SFrame(data='final_data.csv')
We want to create for each crime in the dataset a vector that represents it. One part of the vector will be the dense words vector (in length of 300 cells) that we just created. The other part consists of the features in the dataset. Most of these features are categorical with many categories, and thus the when performing one-hot encoding the resulting vectors will be quite sparse and difficult to interperate. For that reason, after performing one-hot encoding we take the vectors through a dimentionality reduction module - PCA, and result in a denser representation. We take the dense vector and concatenate it to the words vector to recieve the final vector that represents the crime.
data_cluster = data_cluster[['AREA NAME', 'Vict Age', 'Vict Sex', 'Vict Descent', 'LAT', 'LON', 'Date Occ Year', 'Month', 'Hour', 'Week Day', 'neighborhood',
'word2vec_MO_codes_Description', 'MO_Description', 'Crm Cd Desc', 'Weapon Desc', 'Premis Desc']]
data_cluster = data_cluster.fillna('Weapon Desc', "UNKNOWN")
data_cluster = data_cluster.fillna('Premis Desc', "UNKNOWN")
data_cluster = data_cluster.dropna()
from tqdm import tqdm
X = []
df = data_cluster.to_dataframe()
data_cluster['mo_arrays'] = df['word2vec_MO_codes_Description']
df = df.rename(columns={'word2vec_MO_codes_Description': 'mo_arrays'})
df_pca = df[['AREA NAME', 'Vict Age', 'Vict Sex', 'Vict Descent', 'LAT', 'LON', 'Date Occ Year', 'Hour', 'Week Day',
'neighborhood', 'Weapon Desc', 'Premis Desc']]
data_pca = pd.get_dummies(df_pca)
from sklearn.decomposition import PCA
pca = PCA(n_components=10)
sf_pca = tc.SFrame(pca.fit_transform(data_pca))
df['vectors'] = data_cluster['vectors'] = sf_pca['X1']
After we obtained the vectors representing the crimes we can now use them to create a nearest-neighbors model. The model can be used to find the k most similar crimes (neighbors) of a certain crime and thus look for crimes commited by the same criminals. We used TuriCreate's nearest_neighbors module:
knn_model = tc.nearest_neighbors.create(tc.SFrame(data_cluster['vectors', 'mo_arrays']))
Lets use the model for crime number 5 in the dataset as an example. Here are this crime's details:
features = ['AREA NAME', 'Crm Cd Desc', 'Vict Age', 'Vict Sex', 'Vict Descent', 'LAT', 'LON', 'Date Occ Year', 'Month',
'Hour', 'Week Day', 'neighborhood', 'MO_Description', 'Weapon Desc', 'Premis Desc']
data_cluster[features][5]
query = knn_model.query(tc.SFrame(data_cluster['vectors', 'mo_arrays'])[5:6], k=10)
query
We can see that crime number 213411 is the most similar to crime 5. In the following cell we can see that this crime is indeed similar to crime number 5:
data_cluster[features][213411]
In the first tool we present, we use the display_neighbors function that given an index (id) of the queried crime and the requested number of neighbors (k), it displays on a Plotly's interactive map of LA the k most similar crimes that were reported in the past. Each crime is represented as a circle with a different size which is determined by the distance to the queried crime. The more similar the crime is, the larger the circle is. The details of each crime are visible when hovering over the circle. In the following cells we display 2 use-cases as examples.
from plotly_express import ExpressFigure
def display_neighbors(data, crime_idx, knn_model=None, k=10):
"""
Plots an interactive map with the queried crime and the k most similar crimes to it.
Args:
param1: data.
param2: crime_idx. the index (id) of the crime you want to query.
param3: knn_model. a trained knn model, if exists.
param4: k. the number of neighbors to display on the map
Returns:
fig. the interactive map.
"""
if not knn_model:
knn_model = tc.nearest_neighbors.create(tc.SFrame(data['vectors', 'mo_arrays']))
model_vectors = tc.SFrame(data[['vectors', 'mo_arrays']])
query = knn_model.query(model_vectors[crime_idx:crime_idx+1], k=k)
px.set_mapbox_access_token('pk.eyJ1IjoiY2hyaWRkeXAiLCJhIjoiY2ozcGI1MTZ3MDBpcTJ3cXR4b3owdDQwaCJ9.8jpMunbKjdq1anXwU5gxIw')
df_neihg = data.iloc[list(query['reference_label'])].reset_index(drop=True)
l = list(query['rank'])
l.reverse()
df_neihg['size'] = [i/k for i in l]
df_neihg['queried crime'] = ['NO']*k
df_neihg['queried crime'][0] = 'Yes'
features = ['AREA NAME', 'Crm Cd Desc', 'Vict Age', 'Vict Sex', 'Vict Descent', 'Date Occ Year', 'Month',
'Hour', 'Week Day', 'neighborhood', 'MO_Description', 'Premis Desc', 'Weapon Desc']
fig1 = px.scatter_mapbox(df_neihg, lat="LAT", lon="LON", size='size', color="queried crime",
color_discrete_sequence=px.colors.qualitative.Alphabet, size_max=k, zoom=9.25, hover_data=features)
lo = fig1.layout
lo['title'] = f'Crimes That are Related to Crime No. {crime_idx}'
fig = ExpressFigure(data=list(fig1.data), layout=lo)
return fig
display_neighbors(df, crime_idx=15, knn_model, k=15)
display_neighbors(df, crime_idx=1600000, knn_model, k=8)
In the second tool we present, we use the display_clusters function that given a crime type (e.g., Theft), an LAPD area, a year and months of that year produces clusters of crimes that occored at this area in those times and visualize them on a Plotly's interactive map. Similarly to what we previously presented in this section, the relevant examples of the data (after filtering by area and time) goes through PCA dim reduction along with the MO documents' word vectors. The function's consider_victim parameter enables the user to determine whether to use features about the victim or not. For example in car theft the victim's features might be less relevant so the user might choose not to use them. For creating the clusters we used 2 different methods that can be chosen by the c_type parameter:
"k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells". (Wikipedia). When this option is chosen, a TuriCreate k-means model is constructed and the data is partitioned to n clusters (defined by the n_clusters parameter that should be chosen by the user).
If the chosen type is "similarity_graph", a nearest-neighbors model is created with TuriCreate and this model is used to construct a similarity graph of the different crimes. The SGraph object is transformed into a Networkx graph which is being partitioned into cummunities with the help of greedy_modularity_communities module. The crimes of the different comunitues are visualized over an interactive map. In addition, the networkx graph is transformed once again to a pyvis library's graph so that the nodes would be plotted on an interactive graph which enables to display the information of each node (crime) when hovering over the node and to color the node according to its community.
import networkx as nx
from networkx.algorithms.community import greedy_modularity_communities
from pyvis.network import Network
def display_clusters(data, consider_victim=True, c_type='kmeans', crime_type='BURGLARY', area='Hollywood', year=2015, months=[3, 4, 5], n_clusters=6):
"""
Plots an interactive map with the clusters of crimes (and a graph in some cases)
Args:
param1: data.
param2: consider_victim. Boolean to decide whether to use features of the victims or not.
param3: c_type. str to determine the method to use for clustering ('kmeans'/'similarity_graph').
param4: crime_type. str to decide which crime types to focus on.
param5: area. LAPD division area to focus on.
param6: year. which year to filter the data by.
param7: months. which months of that year to filter the data by.
param8: n_clusters. number of clusters to use (only when using kmeans).
Returns:
fig1, the interactive map and G, the interactive graph plot (only on similarity_graph).
"""
features = ['LAT', 'LON', 'Date Occ Year', 'Hour', 'Week Day', 'neighborhood', 'Weapon Desc', 'Premis Desc']
if consider_victim:
features+=['Vict Age', 'Vict Sex', 'Vict Descent']
disp_features = ['Vict Age', 'Vict Sex', 'Vict Descent', 'Hour', 'Week Day', 'neighborhood', 'MO_Description',
'Premis Desc', 'Weapon Desc']
crimes = [i for i in data['Crm Cd Desc'].unique().tolist() if crime_type in i]
data = data[data['Crm Cd Desc'].isin(crimes)]
data = data[data['AREA NAME']==area]
data = data[data['Date Occ Year']==year]
data = data[data['Month'].isin(months)]
data = data.reset_index(drop=True)
data.rename(columns={'word2vec_MO_codes_Description': 'mo_arrays'})
df_pca = data[features]
data_pca = pd.get_dummies(df_pca)
pca = PCA(n_components=10)
sf_pca = tc.SFrame(pca.fit_transform(data_pca))
data['vectors'] = sf_pca['X1']
if c_type=='kmeans':
kmeans_model = tc.kmeans.create(tc.SFrame(data[['vectors', 'mo_arrays']]), num_clusters=n_clusters)
data['row_id'] = range(0, data.shape[0])
data = data.merge(kmeans_model.cluster_id.to_dataframe(), on = 'row_id')
G = None
elif c_type=='similarity_graph':
knn_model = tc.nearest_neighbors.create(tc.SFrame(data[['vectors', 'mo_arrays']]))
sg = knn_model.similarity_graph(k=5)
g = nx.Graph()
G = Network(notebook=True, height='500px', width='1100px')
for v in sg.vertices['__id']:
g.add_node(v, attr_dict=dict(zip(disp_features, data.iloc[v][disp_features])))
G.add_node(v, title=str(dict(zip(disp_features, data.iloc[v][disp_features]))))
for e in sg.edges:
g.add_edge(e["__src_id"], e["__dst_id"])
G.add_edge(e["__src_id"], e["__dst_id"])
c = list(greedy_modularity_communities(g))
data['cluster_id'] = np.zeros(len(data))
print(len(c))
for i in range(len(c)):
data['cluster_id'].loc[list(c[i])] = i
for n in G.nodes:
n['group'] = data['cluster_id'].loc[n['id']]
px.set_mapbox_access_token('pk.eyJ1IjoiY2hyaWRkeXAiLCJhIjoiY2ozcGI1MTZ3MDBpcTJ3cXR4b3owdDQwaCJ9.8jpMunbKjdq1anXwU5gxIw')
data['cluster_id'] = data['cluster_id'].astype(str)
fig1 = px.scatter_mapbox(data, lat="LAT", lon="LON", color="cluster_id",
size_max=10, zoom=12.25, hover_data=disp_features)
lo = fig1.layout
lo['title'] = f'{crime_type} Crimes Clusters in Months {", ".join(str(x) for x in months)} of {year} at the LAPD Area of {area} using {type} method'
lo.font.size = 10
fig = ExpressFigure(data=list(fig1.data), layout=lo)
return fig1, G
We use kmeans to display the clusters of "BURGLARY FROM VEHICLE" crimes that took place between march and may of 2015 in the Hollywood area. Hover over the points to see the crime information. If you wish to look at only one crimes-cluster at a time, double-click the cluster you want in the legend in the upper right-hand side of the map.
fig, _ = display_clusters(df.iloc[:,:-1], consider_victim=False, c_type='kmeans', crime_type='BURGLARY FROM VEHICLE', area='Hollywood')
fig
We use kmeans to display the clusters of "ASSAULT" crimes that took place in June and July of 2010 in the Central area. We use 12 clusters this time and take the victims fetures into consideration. Hover over the points to see the crime information. If you wish to look at only one crimes-cluster at a time, double-click the cluster you want in the legend in the upper right-hand side of the map.
f, _ = display_clusters(df.iloc[:,:-1], consider_victim=True, c_type='kmeans', crime_type='ASSAULT', area='Central', n_clusters=12,
year=2010, months=[6,7])
f
We use similarity graph to display the clusters of "RAPE" crimes that took place between january and may of 2012 in the 77th Street area. We take the victims fetures into consideration. Hover over the points to see the crime information. Here we also plot the resulting similarity graph interactively, where each node color represents a different community. Hover over the nodes of the graph to see the details of the crimes.
f, g = display_clusters(df.iloc[:,:-1], consider_victim=True, c_type='similarity_graph', crime_type='RAPE', area='77th Street',
year=2012, months=[1, 2, 3, 4, 5])
f
g.show('f.html')
We use similarity graph to display the clusters of "HOMICIDE" crimes that took place between January and October of 2018 in the Central area. We take the victims fetures into consideration. Hover over the points to see the crime information. Here we also plot the resulting similarity graph interactively, where each node color represents a different community. Hover over the nodes of the graph to see the details of the crimes.
f, g = display_clusters(df.iloc[:,:-1], consider_victim=True, type='similarity_graph', crime_type='HOMICIDE', area='Central', n_clusters=12,
year=2018, months=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
f
g.show('g.html')
First, we need to understand what is the percentage of crimes in which the weapon that was used by the criminals is unknown or not reported:
df['Weapon Desc'] = df['Weapon Desc'].replace('', 'UNKNOWN')
df['Weapon unknown'] = df['Weapon Desc'].str.contains('UNKNOWN|unknown|unkn', regex=True)
Weapon_unknown = df[['Weapon unknown']]
group = Weapon_unknown.groupby(["Weapon unknown"])['Weapon unknown'].agg(['count'])
group = group.reset_index(level=['Weapon unknown'])
group = group.sort_values("count", ascending=False)
group.plot.pie(y='count', figsize=(5, 5),autopct='%1.0f%%', textprops={'fontsize': 14})
In 35% of the crimes the weapon that was in use by the gang\criminal is unknown, we will try to build a classifier that classifies which weapon was in use. This can assit the police to solve crimes by predicting the type of weapon that was used. we will drop the rows that the weapon is unknown and group the weapons into 5 main categories.
# data_initial = pd.read_csv("Crime_Data_from_2010_to_Present.csv")
keys = data_initial['Weapon Desc'].unique().tolist()[1:]
weapons_dict = {}
for key in keys:
val = data_initial[data_initial['Weapon Desc']==key]['Weapon Used Cd'].reset_index(drop=True)[0]
weapons_dict[key] = val
df = df[df['Weapon unknown']==0]
df['Weapon Used Cd'] = df['Weapon Desc'].apply(lambda x: weapons_dict[x])
df['Target'] = df['Weapon Used Cd'].apply(lambda x: 'Gun' if x<150 else 'Knife' if 199<x<250 else 'Bat' if 299<x<350
else 'Arm' if 399<x<450 else 'Other Weapon')
Then, we will build a baseline classifier, constructing features using NLP tools, and we will try to improve it by adding a clustering feature (adding the cluster that the crime belongs to as a feature for the classifier) that should express the connection between the crimes characters.
We use the Modus Operandi text document and the crime description and apply N-Grams and Bag-of-Words for feature extraction out of the free texts.
sf_cluster = tc.SFrame(df)
sf_cluster['words_1grams_MO_Description'] = tc.text_analytics.count_ngrams(sf_cluster['MO_Description'], n=1, method='word')
sf_cluster['words_2grams_MO_Description'] = tc.text_analytics.count_ngrams(sf_cluster['MO_Description'], n=2, method='word')
sf_cluster['words_dict_MO_Description'] = tc.text_analytics.count_words(sf_cluster['MO_Description'])
sf_cluster['words_1grams_Crm_Cd'] = tc.text_analytics.count_ngrams(sf_cluster['Crm Cd Desc'], n=1, method='word')
sf_cluster['words_2grams_Crm_Cd'] = tc.text_analytics.count_ngrams(sf_cluster['Crm Cd Desc'], n=2, method='word')
sf_cluster['words_dict_Crm_Cd'] = tc.text_analytics.count_words(sf_cluster['Crm Cd Desc'])
Runing the baseline model:
train, test = sf_cluster.random_split(0.8)
cls_base_line = tc.classifier.create(train, features=['words_1grams_MO_Description','words_2grams_MO_Description','words_dict_MO_Description','words_1grams_Crm_Cd',
'words_2grams_Crm_Cd','words_dict_Crm_Cd','Vict Age', 'Vict Sex', 'Vict Descent','LAT', 'LON', 'Date Occ Year', 'Week Day','Hour'], target="Target")
Evaluating the baseline model on the test set:
results_base_line = cls_base_line.evaluate(test)
results_base_line
Results: The random baseline is 0.2 accuracy (5 class) and our baseline model achieved 0.887 . Now lets try to improve it by adding a clustering feature.
Decide on the number of clusters using the elbow method.
df = df.reset_index(drop=True)
np_data = np.array([df['mo_arrays'].values[i].tolist() + df['vectors'].values[i].tolist() for i in range(df.shape[0])])
np_data.shape
# calculate distortion for a range of number of cluster
distortions = []
for i in range(100, 800, 50):
km = KMeans(
n_clusters=i, init='random',
n_init=10, max_iter=300,
tol=1e-04, random_state=0, verbose =False
)
km.fit(np_data)
distortions.append(km.inertia_)
# plot
plt.plot(range(100, 800, 50), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.show()
We choose 350 clusters according to this graph and create the k-means model:
kmeans_model = tc.kmeans.create(data_cluster['vectors', 'mo_arrays'], num_clusters=350)
kmeans_model.summary()
kmeans_model.cluster_info.print_rows(num_columns=5, max_row_width=80, max_column_width=10)
kmeans_model.cluster_info[['cluster_id', 'size', 'sum_squared_distance']]
kmeans_model.cluster_id.head()
Addnig a column to the dataset with the cluster each crime belongs to:
df['row_id'] = range(0,df.shape[0])
df_joined = df.merge(kmeans_model.cluster_id.to_dataframe(), on = 'row_id')
Now we will add the cluster as a feature for the prediction of the crime hour classification task and we will check if the classification results improves
train, test = tc.SFrame(df_joined).random_split(0.8)
cls_with_cluster = tc.classifier.create(train,features=['words_1grams_MO_Description','words_2grams_MO_Description','words_dict_MO_Description','words_1grams_Crm_Cd',
'words_2grams_Crm_Cd','words_dict_Crm_Cd','Vict Age', 'Vict Sex', 'Vict Descent','LAT', 'LON', 'Date Occ Year', 'Week Day','cluster_id','distance','Hour'], target="Target")
Evaluating the improved classifier on the test set:
results_with_cluster = cls_with_cluster.evaluate(test)
results_with_cluster
We can see that neither the accuracy results nor the AUC results improved from the baseline results. Thus, we shouldn't use the clustering features.
Exploring the data, we revealed what are the most common crime types in Los-Angeles and in which areas of LA each type occurs the most. We analyzed the data regarding the victims of the crimes and showed that most victims are in their twenties and that there is a strong correlation between the location of the crime and the descent of the victim. We also analyzed the weapons that were used and which are more common in each hour of the day. With heat-map and choropleth maps we displayed which areas of LA are more crime-stricken and presented how the time (hour of the day, day of week) influences the amount of crime in the different neighborhoods and showed in which neighborhoods the number of crimes increased over the recent years and in which areas it stayed the same.
Using NLP, dimentionality reduction methods and more we constructed tools that can assist law enorcement agencies to find relations and links between crime events. We used the nearest-neighbors algorithm to display on a map the most similar crimes to an investigated crime. We also used k-means and communities detection in a similarity graph to divide crimes into clusters and displayed them on a map.
Finally, using NLP feature extraction tools we developed a classifier to predict the weapons that was used in crimes. This classifier can be used to complete the investigation in crimes in which the investigators don't know what is the weapon that was used.